-
Notifications
You must be signed in to change notification settings - Fork 134
Upgrade to Datafusion 51 #1311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Upgrade to Datafusion 51 #1311
Conversation
| "List(nullable Int64)", | ||
| "List(nullable Int64)", | ||
| "List(nullable Int64)", | ||
| "List(nullable Int64)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists have a new render format: apache/arrow-rs#8290
| ( | ||
| f.regexp_replace(column("a"), literal("(ell|orl)"), literal("-")), | ||
| pa.array(["H-o", "W-d", "!"]), | ||
| pa.array(["H-o", "W-d", "!"], type=pa.string_view()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regexp_replace now uses UTF8View: apache/datafusion#17195
| impl PyCatalog { | ||
| #[new] | ||
| fn new(catalog: PyObject) -> Self { | ||
| fn new(catalog: Py<PyAny>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyObject has been deprecated: https://pyo3.rs/main/migration.html#deprecation-of-pyobject-type-alias
| )))?; | ||
|
|
||
| Python::with_gil(|py| { | ||
| Python::attach(|py| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let streams = spawn_future(py, async move { df.execute_stream_partitioned().await })?; | ||
|
|
||
| let mut schema: Schema = self.df.schema().to_owned().into(); | ||
| let mut schema: Schema = self.df.schema().to_owned().as_arrow().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[allow(clippy::too_many_arguments)] | ||
| #[new] | ||
| #[pyo3(signature = (schema, name, location, file_type, table_partition_cols, if_not_exists, temporary, order_exprs, unbounded, options, constraints, column_defaults, definition=None))] | ||
| #[pyo3(signature = (schema, name, location, file_type, table_partition_cols, if_not_exists, or_replace, temporary, order_exprs, unbounded, options, constraints, column_defaults, definition=None))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| impl PyPrepare { | ||
| #[new] | ||
| pub fn new(name: String, data_types: Vec<PyDataType>, input: PyLogicalPlan) -> Self { | ||
| pub fn new(name: String, fields: Vec<PyArrowType<Field>>, input: PyLogicalPlan) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prepare now stores fields: apache/datafusion#17986
| use datafusion::functions_aggregate::all_default_aggregate_functions; | ||
| use datafusion::functions_window::all_default_window_functions; | ||
| use datafusion::logical_expr::expr::{Alias, FieldMetadata, WindowFunction, WindowFunctionParams}; | ||
| use datafusion::logical_expr::sqlparser::ast::NullTreatment as DFNullTreatment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullTreatment has been moved to not rely on the sqlparser: apache/datafusion#17332
|
Thank you for getting started on this! I was hoping we might delay slightly until DF 51.1.0 releases apache/datafusion#18843 Do you think it's okay to wait a bit before release? Either way we can get this PR in and the update the lock. |
Yeah no problem! Meanwhile it appears that the actions are stuck on the |
|
I don't know what might be causing the |
|
From a little investigation I suspect that the I wonder if the new |
|
@nuno-faria |
|
Further investigation. This diff diff --git a/src/utils.rs b/src/utils.rs
index 6cc9626..251eb41 100644
--- a/src/utils.rs
+++ b/src/utils.rs
@@ -84,7 +84,10 @@ where
tokio::select! {
res = &mut fut => break Ok(res),
_ = sleep(INTERVAL_CHECK_SIGNALS) => {
- Python::attach(|py| py.check_signals())?;
+ Python::attach(|py| {
+ let threading = py.import("threading")?;
+ let _current_thread = threading.call_method0("current_thread")?;
+ py.check_signals()})?;
}
}
}Then instead of getting a keyboard interrupt signal now gives a datafusion error wrapping a keyboard interrupt. I don't think it's important that the above code is calling the import or the call_method. I think the bit that's relevant is that the second call takes time within the attached state. |
|
I was just looking into this and did a similar change just to check if it is executing in the main thread: tokio::select! {
res = &mut fut => break Ok(res),
_ = sleep(INTERVAL_CHECK_SIGNALS) => {
Python::attach(|py| {
let threading = PyModule::import(py, "threading")?;
let get_ident = threading.getattr("get_ident")?;
let thread_id: u64 = get_ident.call0()?.extract()?;
println!("thread_id: {}", thread_id);
let x = py.check_signals();
println!("x: {:?}", x);
x
})?;
}
}Looking at the output, we can see that the thread ids match: I even delayed the signal sent by the However, manually sending CTRL-C results in the exception being returned by the I also checked if it was somehow an issue with |
|
I have a suspicion that what is happening here is that we're uncovering this feature working in pyo3 0.25 but not as intended. I suspect that what is happening is that we're getting a keyboard exception raised in the record batch reader rather than in the |
|
I checked and in the previous version the But it is still odd that manually sending an interrupt with CTRL-C works but using |
|
I've done some more testing. I went in and removed the I do think we've found that the signal catching isn't working exactly like we thought it was. But I also think it's worth taking a look at the changelogs for arrow and datafusion. |
Are you sure this isn't during the first call to |
|
I am tempted to take a stab at redoing our runtime and using a CPU Executor similar to this example: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/query_planning/thread_pools.rs The idea would be to say that all datafusion work gets shipped over to a dedicated threadpool to keep the Python thread lighter, but I am not sure if that's worth it or not. I may be over-analyzing a problem that isn't much of an actual problem. I would like to figure out exactly why this isn't working though. |
I added a small impl Iterator for PartitionedDataFrameStreamReader {
type Item = Result<RecordBatch, ArrowError>;
fn next(&mut self) -> Option<Self::Item> {
while self.current < self.streams.len() {
let stream = &mut self.streams[self.current];
let fut = poll_next_batch(stream);
let result = Python::with_gil(|py| wait_for_future(py, fut));
+ println!("result: {:?}", result);
match result {The is the result now in DF51 (also with the previous debug info in This is the result in DF50 (prints 2992 I also removed the interrupt signal in the |
Updated wait_for_future to surface pending Python exceptions by executing bytecode during signal checks, ensuring that asynchronous interrupts are processed promptly. Enhanced PartitionedDataFrameStreamReader to cancel remaining partition streams on projection errors or Python interrupts, allowing for clean iteration stops. Added regression tests to validate interrupted Arrow C stream reads and improve timing for RecordBatchReader.read_all cancellations.
…g and readability
…k for KeyboardInterrupt
This reverts commit 784929d.
|
hi @nuno-faria |
Improve KeyboardInterrupt handling during Arrow C stream reads in Python bindings
|
Thanks @kosiew, I merged the PR. Just a side note in the def read_stream():
...
try:
...
except KeyboardInterrupt:
read_exception.append(KeyboardInterrupt)
except Exception as e:
read_exception.append(e)The |
|
I believe I have found the root cause of the change in behavior of the interrupt unit test: apache/datafusion#17452 You can test yourself if you go to failing_test.txt I think this note is relevant from the PR: /// This method is async and uses a [`tokio::sync::Barrier`] to wait for all partitions
/// to report their bounds. Once that occurs, the method will resolve for all callers and the
/// dynamic filter will be updated exactly once.
///
/// # Note
///
/// As barriers are reusable, it is likely an error to call this method more times than the
/// total number of partitions - as it can lead to pending futures that never resolve. We rely
/// on correct usage from the caller rather than imposing additional checks here. If this is a concern,
/// consider making the resulting future shared so the ready result can be reused.
I haven't looked much at tokio barriers, but it seems like this is somehow interfering with how we are trying to interrupt the task. I see you have a fix that updates the unit test. That is good, and probably unblocks this PR but I do want to get to the very root of this problem and make sure we have a robust solution. Tagging @rkrishn7 and @adriangb just so they know we're talking about the work they put in upstream. |
Co-authored-by: Tim Saucer <timsaucer@gmail.com>
Which issue does this PR close?
N/A.
Rationale for this change
Upgrade to the latest version of
datafusion.What changes are included in this PR?
Fixed breaking changes.
Are there any user-facing changes?
No.